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Bornes sur les performances de l'algorithme lambda Iterations sur 

les politiques 

Resume : Nous considerons le probleme du controle optimal stationnaire a temps discret, a horizon in- 
fini, avec le critere actualise. Nous etudions lambda Iterations sur les politiques, une famille d'algorithmes 
parametrees par lambda, originellement proposee par Ioffe et Bertsekas. Lambda Iterations sur les poli- 
tiques generalise les algorithmes standards Iterations sur les valeurs et Iterations sur les politiques, et est 
intimiment lie a ralgorithme TD (lambda) propose par Sutton et Barto. Nous approfondissons l'analyse 
originelle developpee par Ioffe et Bertsekas en decrivant des bornes sur la vitesse de convergence qui genera- 
lisent des resultats standards concernant Iterations sur les valeurs decrits par exemple par Puterman. Nous 
developpons egalement l'analyse de cet algorithme lorsqu'il est utilise de maniere approchee. Ce faisant, 
nous etendons et unifions les analyses developpees separement par Munos pour les versions approximatives 
d'lterations sur les valeurs et d'lterations sur les politiques. La contribution principale de cet article est de 
montrer qu'utiliser une version approximative de Lambda Policy Iteration est fonde. 

Mots-cles : Controle optimal, Apprentissage par renforcement, Analyse d'algorithmes, Vitesse de conver- 
gence, Bornes d'erreurs, Processus de decision markovien, Iterations sur les valeurs, Iterations sur les poli- 
tiques, Differences temporelles, Apprentissage par renforcement, Semi-norme span 
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Introduction 

We consider the discrete-time infinite-horizon discounted stationary optimal control problem formalized by 
Markov Decision Processes. We study A Policy Iteration, a family of algorithms parameterized by A, originally 
introduced by Ioffe and Bertsekas [Ij. A Policy Iteration generalizes the standard algorithms Value Iteration 
and Policy Iteration, and has some connections with TD(A) introduced by Sutton & Barto |9]. We deepen the 
original theory developped by Ioffe and Bertsekas [I] by providing convergence rate bounds which generalize 
standard bounds for Value Iteration described for instance by Puterman [7]. We also develop the theory of 
this algorithm when it is used in an approximate form. Doing so, we extend and unify the separate analyses 
developped by Munos for Approximate Value Iteration [5] and Approximate Policy Iteration [I]. The main 
contribution of this paper is to show that doing Approximate Lambda Policy Iteration is sound. 



1 Definition of norms and seminorms 

The analysis we will describe in this article relies on several norms and seminorms which we define here. 

Let X be a finite space. In this section, u denotes a real- valued function on X, which can be seen as a 
vector of dimension \X\. Let e denote the vector of which all components are 1. fj, denotes a distribution on 
X. We consider the weighted L p norm: 




We will write ||.|| the unweighted L p norm (i.e. with uniform distribution /i). H-H^ is the max-norm: 

IML : = max|w(a;)| = lim ||u|| . 

x p — >oo r 

We write span^ [.] the span seminorm (as for instance defined in [7]): 

span^ [u] := max u(x) — minu(a;). 

X X 

It can be seen that 

span^ [u] = 2 min \\u — ae\\ . 

a 

It is thus natural to generalize the span seminorm definition as follows: 

span p i(U [u] := 2 min ||u - ae\\ p ^ 

It is clear that it is a seminorm (i.e. it is non-negative, it satisfies the triangle inequality and span„ [ait] = 
lalspan^ [«]). It is not a norm because it is zero for all constant functions. 

The error bounds we will derive in this paper are expressed in terms of some span seminorm. The following 
rclcttions 

span p M < 2|M| p < 2^ 

span^M < 2\\u\\ pfi < 2||u| B (1) 
span^ [u] < 2 

show how to deduce error bounds involving the (more standard) L p and max norms. Since the span seminorm 
can be zero for non zero (constant) vectors, there is no relation that would enable us to derive error bounds 
in span seminorm from a L p or a max norm. Bounding an error with the span seminorm is in this sense 
stronger and this constitutes our motivation for using it. 
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2 Markov Decision Processes 

This paper is about the discrete-time infinite-horizon discounted stationary optimal control problem, which 
we describe now. 

2.1 The Stochastic Optimal Control Problem 

This paper is about the discrete-time infinite-horizon discounted stationary optimal control problem, which 
we describe now. We consider a discrete-time dynamic system whose state transition depends on a control. 
We assume that there is a state space X of finite size TV. When at state i, the control is chosen from a finite 
control space A. The control a £ A specifies the transition probability Pij{a) to the next state j. At 
the fcth iteration, the system is given a reward "f k r{i,a,j) where r is the instantaneous reward function, 
and < 7 < 1 is a discount factor. The tuple (X, A,p, r, 7) is known as a Markov Decision Process [7]. 

We are interested in stationary deterministic policies, that is functions ir : X — > A which map states into 
control^]- Writing i k the state at time k, the value of policy it at state i is defined as the total expected 
return while following a policy ir from i, that is 



v*(i) := lim E„ 

TV— >oo 



N-l 



^ J k r(i k ,ir(i k ),i k+1 ] 



k=0 



l = I 



(2) 



where E v denotes the expectation conditional on the fact that the actions are selected with the policy it. 
The optimal value starting from state i is defined as 

v*(i) := maxw 7r (z). 

7T 

We write P 71 ' the NxN stochastic matrix whose elements are pij(ir(i)) and r w the vector whose components 
are ^ - pij(n(i))r(i,n(i),j). and can be seen as vectors on X. It is well-known that v 7r solves the 
following Bellman equation: 

v n = r' + ir/. 

is a fixed point of the linear backup operator T v v := r" + jP^v. As P* is a stochastic matrix, its 
eigenvalues cannot be greater than 1, and consequently / — 7P 7r is invertible. This implies that 

00 

v 7 " = (I- 7 P 7r )-V 7r = ^( 7 P 7r )V 7r . (3) 

i=0 

It is also well-known that satisfies the following Bellman equation: 

u* = max(r 7r + jP^v*) = maxT'u, 

TT TT 

v* is a fixed point of the nonlinear backup operator Tv := max^ T^v. Once the optimal value i>* is computed, 
deriving an optimal policy is straightforward. For any value vector v, we call a greedy policy with respect 
to the value v a policy ir that satisfies: 

7r € argmaxT 77 v 

tt' 

or equivalently T"« = Tv. We will write, with some abuse of notation^] greedy(w) any policy that is greedy 
with respect to v. The notions of optimal value function and greedy policies are fundamental to optimal 



1 Restricting our attention to stationary deterministic policies is not a limitation. Indeed, for the optimality criterion to be 
defined soon, it can indeed be shown that there exists at least one stationary deterministic policy which is optimal [7]. 
2 There might be several policies that are greedy with respect to some value v. 
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control because of the following property: any policy 7T* that is greedy with respect to the optimal value is 
an optimal policy and its value v** equals v*. 

It is well-known that the backup operators T x and T are 7-contraction mappings with respect to the 
max-norm. In what follows we only write what this means for the Bellman operator T but the same holds 
for T 77 . Being a 7-contraction mapping for the max-norm means that for all pairs of vectors (v,w), 

WTv-TwW^ < tIK'-HIoo- 

This implies that the corresponding fixed point exists and is unique and that for any initial vector vq , 

lim {T) k v = (4) 

2.2 Value Iteration 

The Value Iteration algorithms for computing the value of a policy it and the value of the optimal policy 7T* 
rely on equation 01 Algorithm Q] provides a description of Value Iteration for computing an optimal policy 
(replace T by in it and one gets Value Iteration for computing the value of policy ir). In this description, 

Algorithm 1 Value Iteration 
Input: An MDP, an initial value vq 
Output: An (approximately) optimal policy 

fc <- 

repeat 

Vk+i <— 1~Vk + £fe+i / / Update the value 
k <- k + 1 
until some stopping criterion 
Return greedy (vk) 



we have introduced a term which stands for several possible sources of error at each iteration: this error 
might be the computer round off, the fact that we use an approximate architecture for representing v, a 
stochastic approximation of P^ k , etc... or a combination of these. In what follows, when we talk about the 
Exact version of an algorithm, this means that = for all k. 

Properties of Exact Value Iteration It is well-known that the contraction property induces some 
interesting properties for Exact Value Iteration. We have already mentioned that contraction implies the 
asymptotic convergence (equation 0j). It can also be inferred that there is at least a linear rate of convergence: 
for all reference iteration fco, and for all k > ko 

H«.-«klL <7 fe ~ feo K-vfeoL,. 

Even more interestingly, it is possible to derive a performance bound, that is a bound of the difference 
between the real value of a policy produced by the algorithm and the value of the optimal policy 71-* (see for 
instance [7]). Let irk denote the policy that is greedy with respect to Vk-i. Then, for all reference iteration 
fco, and for all k > fco, 

\\ v * -'^Woo < -j— — \\ Tv ko — «fcolloo = -j— — \\ v k + l -Vfeolloo- 

This fact is of considerable importance computationally since it provides a stopping criterion: taking k = 
fc + 1, we see that if ||«fc D +i - v ko < ^e, then - v 7Tk o+ 1 ^ < e. 

It is somewhat less known that the Bellamn operators T and T T are also contraction mapping with respect 
to the span^ seminorm [7] . This means that there exists a variant of the above equation involving the span 
seminorm instead of the max-norm. For instance, such a fact provides the following stopping criterion [7j: 
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Proposition 2.1 (Stopping Condition for Exact Value Iteration [7]). 

// at some iteration ko, the difference between two subsequent iterations satisfies 

1-7 

span^ [v ko+1 - v ko ] < e 

7 

then the greedy policy TT ko +i with respect to v ko is e-optimal: \\v* — v nk « +1 \\ 00 < e. 

This latter stopping criterion is better since, from the relation between the span seminorm and the norm 
(equation [1]) it implies the former. 



Properties of Approximate Value Iteration When considering large Markov Decision Processes, one 
cannot usually implement an exact version of Value Iteration. In such a case e k ^ 0. In general, the algorithm 
does not converge anymore but it is possible to study its asymptotic behaviour. The most well-known result 
is due to Bertsekas and Tsitsiklis [2]: If the approximation errors are uniformly bounded He/tH^ < e, then 
the difference between the asymptotic performance of policies ir k +i greedy with respect to v k satisfies 

2 7 

limsup IK-i^lL < , , 2 e. (5) 

fc — ^OO \ J J 

Munos has recently argued in [U \E\ that, since most supervised learning algorithms (such as least square 
regression) that are used in practice for approximating each iterate of Value Iteration minimize an empirical 
approximation in some L p norm, it would be more interesting to have an analogue of the above result where 
the approximation error e is expressed in terms of the L p norm. Munos actually showed how to do this in [5]. 
The idea is to analyze the componentwise asymptotic behaviour of Approximate Value Iteration, from which 
it is rather easy to derive L p analysis for any p. Write P k = P^ k the stochastic matrix corresponding to the 
policy n k which is greedy with respect to v k -i, P* the stochastic matrix corresponding to the (unknown) 
optimal policy it*. 

Lemma 2.2 (Componentwise Asymptotic Performance of Approximate Value Iteration [5]). 
The following matrices 

Qkj ■= -jPkV^kPk-i-Pj+i 
Q' kJ := (l- 7 )(J-7P fc )-W _j 

are stochastic and the asymptotic performance of the policies generated by Approximate Value Iteration sat- 
isfies 

, fe-i 

limsup v* - v* k < limsup ^ j k ' : > [Q kj - Q' kj ] e r 

k — >oc k — voo 7 j_Q 

From the above componentwise bound, it is possibl^l to derive the following L p bounds. 



3 This result is not explicitely stated by Munos in [SJ, but using a technique of another of his articles [4], it is straightforward 
to derive from Lemma[22] The current paper will anyway generalize this result (in Proposition 14.81 page 1231) . 
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Proposition 2.3 (Asymptotic Performance of Approximate Value Iteration I). 

Choose any p any distribution fx. Consider the notations of Lemma \2.S\ and \3.4\ If the approximation errors 
are uniformly bounded for the following set of norms 

™>j, \\ek\\ pMQjk+Q , k) <e 

then the asymptotic performance of policies generated by Value Iteration satisfies 

2 7 

lin ^ S UP IK ~ W W *llp lA , < ^ _ 7 )2 C - 

Munos also introduces some concentration coefficient |4J[5]: Assume there exists a distribution v and a 
real number C(y) such that 

C(v) := max . .. . (6) 

For instance, if one chooses the uniform law v, then there always exists such a C{y) € (1, AT) where A is 
the size of the state space. See IHH] f° r more discussion on this coefficient. This concentration coefficient is 
interesting because it allows to derive the following performance bounds on the max-norm of the loss. 

Proposition 2.4 (Asymptotic Performance of Approximate Value Iteration II [5]). 

Let C(y) be the concentration coefficient defined in equation® If the approximation errors are uniformly 
bounded 

IMI P ,„ ^ e 

then the asymptotic performance of the policies generated by Approximate Value Iteration satisfies 



The main difference between the bounds of Propositions 12.31 and 12.41 and that of Bertsekas and Tsitsiklis 
(equation [5J is that the approximation error e is controlled by the weighted L p norm. As limp^oo ||.|| < 

Il-H^ and (C(^)) 1 ^ P -Z^ l ; Munos's results are strictly better. 

There is in general no guarantee that AVI converges. AVI is known to converge for specific approximation 
architectures known as averagers [5] which include state aggregation. Also, convergence may just occur 
experimentally. Suppose (vk) tends to some v. Write it the corresponding greedy policy. Note also that (e^) 
tends to v — Tv, which is known as the Bellman residual. The above bounds apply, but in this specific 
case, they can be improved by a factor j^- It is indeed known (e.g. [10]) that 

IK-«loc< (73^-^1 



oo 



and, with the same notations as above, Munos derived the analogous better L p bound [5]: 

Corollary 2.5 (Performance of Approximate Value Iteration in case of convergence [5]). 

Let C{v) be the concentration coefficient defined in equation® Suppose (yk) tends to some v. Write ir the 

corresponding greedy policy. Then 

IK -*i <^ c( ^ V Vt,|| 

II * lloo — Q _ \ II \\p,v 

Eventually, let us mention that in [5] , Munos also shows some finer performance bounds (in weighted L p 
norm) using some finer concentration coefficients. We won't discuss them in this paper and we recommend 
the interested reader to go through [5]. 
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2.3 Policy Iteration 

Policy Iteration is an alternative method for computing an optimal policy for an infinite-horizon discounted 
Markov Decision Process. This algorithm is based on the following property: if ir is some policy, then any 

Algorithm 2 Policy Iteration 
Input: An MDP, an initial policy ttq 
Output: An (approximately) optimal policy 

k <- 

repeat 

v k <— (I - ^p^J-V" 1 + e k J/ Estimate the value of ir k 
tt/c+i <— greedy(ufc) // Update the policy 

k <- jfe + 1 
until some stopping criterion 
Return 7rfc 



policy 7r' that is greedy with respect to the value of ir, i.e. any it' satisfying n' = greedy(w 7r ), is better than ir 
in the sense that i)" > Policy Iteration exploits this property in order to generate a sequence of policies 
with increasing values. It is described in Algorithm [H Note that we use the analytical form of the value of 
a policy given by equation [3l Also, as for Value Iteration, our description includes a potential error e k term 
each time the value of a policy is estimated. 

Properties of Exact Policy Iteration When the state space and the control spaces are finite, it is well- 
known that Exact Policy Iteration converges to an optimal policy 7r» in a finite number of iterations. It is 
known that the rate of convergence is at least linear [7]. If the function v t— * pS ree dy( u ) is Lipschitz, then it 
can be shown that Policy Iteration has a quadratic convergence [7]. However, we did not find any stopping 
condition in the literature that is similar to the one of Proposition 12.11 

Properties of Approximate Policy Iteration For problems of interest, one usually uses Policy Iteration 
in an approximate form, i.e. with e k 0. Results similar to those we presented for Approximate Value 
Iteration exist for Approximate Policy Iteration. As soon as there is some error e k ^ 0, the algorithm does 
not necessarily converge anymore but there is an analog of equation [5] which is also due to Bertsekas and 
Tsitsiklis [2]: If the approximation errors are uniformly bounded (Vfc, ||efc || ^ < e), then the difference between 
the asymptotic performance of policies irk+i greedy with respect to v k and the optimal policy is 

hmsup ||^-^|| m <-^_ £ . (7) 

— ^ OO \ J } 

As for Value Iteraton, Munos has extended this result so that one can get bounds involving the L p norm. He 
also showed how to relate the performance analysis to the Bellman residual v k —T 7rk v k that says how much v k 
approximates the real value of the policy iTk ; this is interesting when the evaluation step of Approximate Policy 
Iteration involves the minimization of this Bellman residual. It is important to note that this Bellman residual 
is different from the one we introduced in the previous section (we then considered v k — Tvk — v k — T 7rk+1 v k 
where n k+ i is greedy with respect to v k ). To avoid confusion, and because it is related to some specific 
policy, we will call v k — T 7rk v k the Policy Bellman residual. Munos started by deriving a componentwise 
analysis. Write P k = P nk the stochastic matrix corresponding to the policy ir k which is greedy with respect 
to Ufc-i, P* the stochastic matrix corresponding to the (unknown) optimal policy 7r*. 
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Lemma 2.6 (Componentwise Asymptotic Performance of Approximate Policy Iteration [1]). 
The following matrices 

R k ~ {i-"f) 2 {i- 1 p*)- i p k+ i(i-"fPk+ir 1 

R' k := (1 - 7 ) 2 (/ - IP*)' 1 [P* + lPk+i(I - yPk+i^Pk] 
R'l := (l- 7 )2(/_ 7 P,)-ip,(/_ 7 P fe ) 

are stochastic and the asymptotic performance of the policies generated by Approximate Policy Iteration 
satisfies 

limsup v* - v 1Tk < — — limsup [R k - R k ] £k 

k^oo (1 - 1) k^oo 



2£L 

i 1 - ir k- 



limsup v* - v nk < — rTT limsup [Rk - R'l} (ufc - T" Kh v k ). 

(i-7r 



As for Value Iteration, the above componentwise bound leads to the following L p bounds. 
Proposition 2.7 (Asymptotic Performance of Approximate Policy Iteration I [4]). 

Choose any p and any distribution jj,. With the notations of Lemma \2.6\ the asymptotic performance of the 
policies generated by Approximate Policy Iteration satisfies 

limsup K- V **|I P)M < limsup \\e k \\ pMR]k+K) 

limsup K < , )2 limsup \\v k - T^v k \\ p , J Rjh+K ) ■ 

Using the concentration coefficient C[y) introduced in the previous section (equation [6]) , it is also possible 
to show3 the following weighted L p bounds: 

Proposition 2.8 (Asymptotic Performance of Approximate Policy Iteration II). 

Let C{v) be the concentration coefficient defined in equation^ The asymptotic performance of the policies 
generated by Approximate Policy Iteration satisfies 



limsup ll^-^IL < 2l l?^l] 2 /P Kmzvp ||e fc|l , 

k— too \* i ) k— * - - 



limsup K-^|loo < ^7^%^ limsu P \\vk-T"-v k \\ PiV . 

k — >oc \L i ) k — >oc 

Again, the bounds of Propositions 12.71 and 12.81 with respect to the approximation error e are better than 
that of Bertsekas and Tsitsiklis (equation [7]). Compared to the analog result for Approximate Value Iteration 
(Propositions 12.31 and I2.4[) where the bound depends on a uniform error bound (Vfc, ||efc|| < e), the above 
bounds have the nice property that they only depend on asymptotic errors/residuals. 

Finally, as for Approximate Value Iteration, a better bound (by a factor j-zz) might be obtained if the 
sequence of policies happens to converge. It can be shown (from [4], Remark 4 page 7) that: 



4 Similarly to footnote [3] this result is not explicitely stated by Munos in [I] but using techniques of another of his articles 
[5], it is straightforward to derive from Lemma l2.6l The current paper will anyway generalize this result (in Proposition 14.101 
page [23j| . 
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Corollary 2.9 (Performance of Approximate Policy Iteration in case of convergence). 

Let C(y) be the concentration coefficient defined in equation^ If the sequence of policies (?Tfc) converges to 
some tt, then 

„._„.< M^sup |M U 

(.1 - 7 J fe^oo 



27(CM) 1/ * j 



w * ^ ( i_ 7) limsu P K-^^IU 



2.4 A Policy Iteration 

Value Iteration and Policy Iteration are often considered as unrelated algorithms. Though all the results we 
have emphasized so far are strongly related (and even sometimes identical) , they were proved independently 
for each algorithm. In this section, we describe a family of algorithms called "A Policy Iteration" (originally 
introduced in [1]) parameterized by a coefficient A G (0, 1), that generalizes them both. When A = 0, A Policy 
Iteration will reduce to Value Iteration while it will reduce to Policy Iteration when A = 1. We will also 
briefly discuss the fact that A Policy Iteration draws some connections with Temporal Difference algorithms 
which one finds in the Reinforcement Learning literature [9]. 

We begin by giving some intuition about how one can make a connection between Value Iteration and 
Policy Iteration. For the moment let us forget about the error term e k . Value Iteration seems to build a 
sequence of value functions and Policy Iteration a sequence of policies. Both algorithms can in fact be seen 
as updating a sequence of value-policy pairs. With some little rewriting — by decomposing the (nonlinear) 
Bellman operator T into 1) the maximization step and 2) the application of the (linear) Bellman operator 
— it can be seen that each iterate of Value Iteration is equivalent to the two following updates: 

( TTfc+i <- greedy(w fc ) f TTfc+i «- greedy(w fc ) 

\ v k+1 «- T Wk + l v k \ v k +i «- r^+ 1 +7P'"+ 1 ^. 

The left hande side of the above equation uses the operator T 7rfe+1 while the right hande side uses its definition. 
Similarly — by inversing in Algorithm [2] the order of 1) the estimation of the value of the current policy and 
2) the update of the policy, and by using the fact that the value of policy irk+i is the fixed point of T 7rfc+1 
(equation |4]) — it can be argued that every iteration of Policy Iteration does the following: 

f TTfc+i <- greedy(wfc) f ir k+1 «- greedy^) 

\ v k+1 «- (T**+0°°t; fc v k+1 «- (I_ 7 Pt*+i)-V«h-i 

Thanks to this little rewriting, both algorithms now look close to each other. Both can be seen as having 
an estimate v k of the value of policy n k , from which they deduce a potentially better policy TTh+i- The 
corresponding value v' Kk+1 of this better policy may be regarded as a target which is going to be tracked by 
the next estimate v k +i- The difference is in the update that enables to go from v k to v k+1 : while Policy 
Iteration directly jumps to the value of ir k+ i (by applying the Bellman operator T Wfc + 1 an infinite number of 
times), Value Iteration only makes one step towards it (by applying T 7rfc+1 only once). From this common 
view of Value Iteration, it would be possible to naturally introduce the well-known Modified Policy Iteration 
algorithm (see e.g. [J]) which makes n steps at each update: 

f Tifc+i *- greedy(wfc) f 7r fc+1 <- greedy(w fe ) 

X v k+1 <- (T^ 1 )™^ X v k+i <- [l + lP 7 ' k+1 + ... + ( 7 P 7rfc + 1 )™- 1 ] r^ 1 + (■yP' Tk + 1 ) n v k 

The above common view is actually here interesting because it leads to a natural introduction of A Policy 
Iteration. A Policy Iteration is doing a X-adjustable step towards the value of 7Tfe_|_i : 

/ TTfe+i <- greedy(wfc) f n k+1 <- greedy(u fc ) 

X Vk+i <- (1- X)J2T=o X3 (^ k+i y +1 ^ X v k+i «- (I - \jP 7rk + 1 )- 1 {r^ +(1- A)7P^+i Ufc ) 
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More formally, A Policy Iteration (see the above left hand side) consists in doing a A-geometric average of 
the different number of applications of the Bellman operator (T 7rfc + 1 )- J to Vk- The right hand side is here 
interesting because it clearly shows that A Policy Iteration generalizes Value Iteration (when A = 0) and 
Policy Iteration (when A = 1). The actual equivalence between the left and the right can be proved as 
follows: 



(l-A)^A j (T^+i) J+ V 

j=0 



(1-A)X>* 



3=0 



3 = 1 = 3 = 

oo oo oo 

- A)A 3 ( 7 P' rfc + 1 )V ,rfc + 1 + (1 - A) J2 X 1 ' (rfP* k+1 ) j+1 v k 

i=0 j=l 3=0 



E 



^V(7P' rfc + 1 ) i r 7rfc + 1 - A J+1 (7P' rfc + 1 ) i r 7rfc + 1 ^ 



i=l 



V+ 1 , 



+ (1-A)]Ta j ( 7 P^+ 1 )" 

3=0 



Y2x l (-yP 7Tk + 1 )'r 7Tk + 1 +(1 - A)^A J ( 7 P' rfc + 1 ) J+1 



3 = 



= ^(A 7 P 7rfc + 1 ) i (r 7Vk + 1 + (1 - A) 7 P' rfe+1 w fe ) 

i=0 

= (J - A 7 P' rfc + 1 )" :L (r' rfc + 1 + (1 - A) 7 P ?rfc + 1 « fc ) . 



In order to describe the A Policy Iteration algorithm, it is useful to introduce a new operator. For any 
value v and any policy ir, define (the following four formulations are equivalent up to some little linear algebra 
manipulations): 

Tfv := v + (I- X-fP^y^T^v-v) (8) 
= (I — XjP^y 1 (T n v - X 1 P 7T v) 

= (I — A^)- 1 (r T + (1 - A) 7 P 7r w) (9) 

= (I — A 7 P 7r )~ 1 (Ar 7r + (1 - AjT^t;) (10) 

A Policy Iteration is formally described in Algorithm [31 Once again, our description includes a potential 



Algorithm 3 A Policy Iteration 

Input: An MDP, A 6 (0, 1), an initial value vq 

Output: An (approximately) optimal policy 

k <- 

repeat 

nk+i <— greedy (wfe) / / Update the policy 

Vk+i <— T^ h+1 Vk + £fc+i // Update the estimate of the value of policy irk+i 
k <- fc + 1 
until some convergence criterion 
Return greedy (i^) 



error term each time the value is updated. Even with this error term, it is straightforward to see that the 
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algorithm reduces to Value Iteration (Algorithm [T]) when A = and to Policy Iteration^ (Algorithm [2]) when 
A = 1. 

Relation with Reinforcement Learning The definition of the operator T£ given by equation [9] is the 
form we have used for the introduction of A Policy Iteration as an intermediate algorithm between Value 
Iteration and Policy Iteration. The equivalent form given by equation [8] can be used to make a connection 
with the TD(A) algorithm^] that one finds in the Reinforcement Learning literature, of which a reference is 
the book by Sutton & Barto [9]. Indeed, equation [9] can be seen as an incremental additive procedure: 

v k+1 <- v k + 5 k 

where 

S k := (I - X 1 P^y 1 (T^v k - v k ) 

is zero if and only if the value v k is equal to the optimal value . It can be shown (see pQ for a proof or 
simply look at the equivalence between equations [2] and [3] for the intuition) that the vector S k has components 
given by: 



A fc (i) = lim E n 



N-l 



with 

dk(i,j) ■= r(i,ir k +i(i),j) + jV(j) - V(i) 

being the temporal difference associated to transition i — > j, as introduced by Sutton & Barto [9]. When one 
uses a stochastic approximation of A Policy Iteration, that is when the expectation E 7rt+1 is approximated 
by sampling, A Policy Iteration reduces to the algorithm TD(A) which is described in chapter 7 of [9j. Also, 
Bertsekas and Ioffe pQ showed that Approximate TD(A) with a linear feature architecture, as described in 
chapter 8.2 of [9], corresponds to Approximate A Policy Iteration where the value is updated by least square 
fitting using a gradient-type iteration after each sample. Eventually, the interested reader might notice that 
the "unified view" of Reinforcement Learning algorithms which is depicted in chapter 10.1 of [9] is in fact a 
picture of A Policy Iteration. 

Properties of Exact A Policy Iteration In the original article introducing A Policy Iteration [I], Bert- 
sekas and Ioffe provide an analysis of the algorithm when it is run exactly (when e k = 0) . Define the following 
factor 

We have < (3 < 7 < 1. If A = (Value Iteration) then j3 = 7, and if A = 1 (Policy Iteration) then j3 = 0. 
They give some insight on what happens at each iteration. For all k, define the operator 

W, M k v := (1 - \)T* k+1 v k + XT 7rk+1 v (12) 

Assume T^ k+1 is a contraction mapping of modulus a for some norm ||.||; this assumption is always true with 
a = 7 and the max-norm. Then [1] 

• Mfc is a contraction mapping of modulus (3a for the same norm ||.||. 



6 Policy Iteration starts with an initial policy while A Policy Iteration starts with some initial value. To be precise, 1 Policy 
Iteration starting with vq is equivalent to Policy Iteration starting with the greedy policy with respect to vq. 

6 TD stands for Temporal Difference. The connection with TD algorithms was one of the motivation of the original article 
about A Policy Iteration by Bertsekas and Ioffe [f]. Indeed, A Policy Iteration is there also called "Temporal Difference Based 
Policy Iteration" and the presentation the authors give starts from the formulation of equation \8\ (which is close to TD(A)) and 
then makes the connection with Value Iteration and Policy Iteration. 
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• The next iterate Vk+i is the (unique) fixed point of Mk. 

The operator Mk gives some insight on how to concretely implement one iteration of A Policy Iteration: it 
can for instance be done through a Value Iteration-like algorithm which applies Mk iteratively. Also, the 
fact that its contraction factor (3 can be tuned is of particular importance because finding the corresponding 
fixed point can be much easier [JJ than that of T 7rfc+1 , which is only 7-contracting. 

Bertsekas and Ioffe also show the convergence and provide an asymptotic rate of convergence: 

Proposition 2.10 (Convergence and Rate of Convergence of Exact A Policy Iteration [TJ). 
If the discount factor 7 < 1, then Vk converges to u*. Furthermore, after some index the rate of conver- 
gence is linear in (3 as defined in equation \ll\ that is 



By making A close to 1, j3 can be arbitrarily close to so the above rate of convergence might look 
impressive. This needs to be put into perspective: the index k* is the index after which the policy nk does 
not change anymore (and is equal to the optimal policy 71"*). As we said when we introduced the algorithm, 
A controls the speed at which one wants Vk to "track the target" v 7Tk+1 ; when A = 1, this is done in one 
step (and if ttu+i = ?r* then Vk+i = V*). However, the bigger the value of A, the less the operator Mk is 
contracting (recall that the contraction factor is (3 defined in equation [TT]) , and the more time it might take 
to compute the next iterate Vk+i (its fixed point). 

Analyses are usually simpler for Value Iteration than for Policy Iteration. Analyses of Value Iteration 
is based on the fact that it is an algorithm that computes the fixed point of the Bellman operator which is 
a 7-contraction mapping in max-norm (see e.g. [2]). Unfortunately, it can be shown that the operator by 
which Policy Iteration updates the value from one iteration to the next is in general not a contraction in 
max-norm; Analyses of Policy Iteration rely on some other properties, like the fact that the sequence values 
is (approximately) non-decreasing (see OS])- In fact, this observation can be drawn for A Policy Iteration 
as soon as it does not reduce to Value Iteration: 

Proposition 2.11. As soon as A > 0, there exists no norm for which the operator by which A Policy Iteration 
updates the value from one iteration to the next is a contraction. 

Proof. To see this, consider the following deterministic MDP with two states {1, 2} and two actions {change, stay}: 
n = 0, r 2 = 1, P c hange(s 2 \si) = P c han g e(si\s 2 ) = P s tay (si |«i) = Pstay (s 2 1 s 2 ) = 1. Consider the follow- 
ing two value functions v = (e,0) and v' = (0,e) with e > 0. Their corresponding greedy policies are 
7r = (stay, change) and ir' = (change, stay). Then, we can compute the next iterates of v and v' (using 
equation [9]) : 



Vfc > K, ||vfc+i - t>»|| < /3\\vk - 



r n + (1 



A7)P w v 



((1-A) 7 e,l + (1-A) 7 e) 




r 11 ' + (1 



A 7 )P 7r 'w' 



((1 - A) 7 e, 1 + (1 - A) 7 e) 



v' 




Then 





As all norms are equivalent, and as e can be arbitrarily small, the norm of T£v — T£ v' can be arbitrarily 
larger than norm of v — v' when A > 0. □ 
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Approximate A Policy Iteration In their original article [T], Bertsekas and Ioffe describe a case study 
involving an instance of Approximate A Policy Iteration (with (e^ ^ 0). However, to the best of our knowledge, 
there is no work studying the theoretical soundness of doing Approximate A Policy Iteration. 

2.5 Our contributions 

Now that we have described the algorithms and some of their known properties, motivating the remaining 
of this paper is straightforward. A Policy Iteration is conceptually a very nice algorithm since it generalizes 
the two most-well known algorithms for solving discounted infinite-horizon Markov Decision Processes. The 
natural question that arises is whether one can generalizes the results we have described so far to A Policy 
Iteration(uniformly for all A). The answer is yes: 

• We provide a componentwise analysis of Exact and Approximate A Policy Iteration. 

• We show that the convergence rate of Exact A Policy Iteration is linear (Proposition 12.101 only showed 
the asymptotic linear convergence) and we generalize the stopping criterion described for Value Iteration 
(Proposition 12. ip . 

• We give componentwise and span seminorm bounds of the asymptotic error of Approximate A Policy 
Iteration with respect to the asymptotic approximation error, Bellman residual, and Policy Bellman 
residual. This generalizes Lemmas 12.21 and 12.61 an d Propositions I2.3|2.4[ 12.71 and 12.81 Our analysis 
actually implies that doing Approximate A Policy Iteration is sound (when the approximation error 
tends to 0, the algorithm finds the optimal solution). 

• We provide specific (better) bounds for the case when the value or the policy converges, which gener- 
alizes Corollaries 12.51 and 12.91 

• Last but not least, we provide all our results using the span seminorms we have introduced in section 
[H and using the relations between this span semi-norms and the standard L p norms (equation [1]) , it 
can be seen that our results are slightly stronger than all the previously described results. 

Conceptually, we provide a unified vision (unified proofs, unified results) for all the mentioned algorithms. 

3 Componentwise Performance bounds for A Policy Iteration 

This section contains our main results, which take the form of componentwise bounds of the performance 
bounds when using A Policy Iteration. For clarity, most proofs are deferred to the appendix. The core of 
our work is the complete analysis of A Policy Iteration (Appendix [Bj . It serves as a basis for computing 
the rate of convergence of Exact A Policy Iteration (Section I3.1[ proof in Appendix [C]) and the asymptotic 
performance of Approximate A Policy Iteration with respect to the approximation error ( Section I5~2l proof in 
Appendix [D]) . The asymptotic performance of Approximate A Policy Iteration with respect to the Bellman 
residuals is somewhat simpler and is proved independently (Section 13.21 proof in Appendix [A} . 

Notations For clarity, we use the following lighter notations: Pk := P^ k , T k '■= T 7Tk , P* := P w * . We will 
refer to the factor (3 as introduced by Bertsekas and Ioffe (equation[TT]page[T3|). Also, the following stochastic 
matrix will play a recurrent role in our analysis: 

A k := (l-\ 1 ){I-\ 1 P k )- 1 P k . 
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3.1 Performance bounds for Exact A Policy Iteration 

Consider Exact A Policy Iteration. We have the following convergence rate bounds: 

Lemma 3.1 (Componentwise Rate of Convergence bounds for Exact A Policy Iteration). 
The following matrices 

E k k := (l-7)(ft)*"*°a-7ft)" 1 

EL := (^5=jj) E 7 t " 1 " 3 /3 3 " t0 (f.) t " 1 " 3 ^+i4-A+i + ^" t °a-7ft)" 1 4^-i.^^+i 

F**o : = (l-7)^" fe0 +7iJL K 
are stochastic and 

[F kko -E' kko ] («„ -« fco ) 

[^fcfco --^fefco] (T«feo -«fco)' 

These bounds imply that A Policy Iteration converges to the optimal value function at least at a linear 
rate. 

3.2 Performance bounds for Approximate A Policy Iteration 

Let us now consider Approximate A Policy Iteration. We provide componentwise bounds of the loss u* — v !rk 
of using policy -K k instead of using the optimal policy, with respect to the approximation error e k , the Policy 
Bellman residual T k v k — v k and the Bellman residual Tv k — v k = T k +\v k — v k . Recall the subtle difference 
between these two Bellman residuals: the Policy Bellman residual says how much v k differs from the value 
of ir k while the Bellman residual says how much v k differs from the value of the policies ir k +\ and 7T*. 



< 




-fco 




1 - 


- 7 


< 




-ko 




l - 




< 




-fco 
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Lemma 3.2 (Componentwise Performance bounds for Approximate A Policy Iteration). 
The following matrices 



p . 1-7 



A 7 

6 — J 

B' jk := 7 S jfc P J + (l- 7 )(P,) fe ^ 



fe-i 

Y^-y- E l k ' 1 ~ l P l ~ J {P*) k ' 1 ' l A l+1 A l ...A J+1 + [3 k -i{I - 1 P k )' 1 A k A k ^ 1 ...A J+l 



c k 


■= (1 




IP*)- 1 


c' k 


:= (1 


-7) 2 (/" 


IP*)- 1 


D 


:= (1 


-l)P*(I 


-7ft)" 


D' k 


:= (1 


-l)Pk{I 


- iPkY 



v- 1 ) 



are stochastic and 

fc-i 



Vfc , lim sup - w 7rfc < — — limsup ^ 7' j [Bjk - B' jk ] Cj 

k — >oo ^ / — >oo - 1 

3=k a 



7 

lim sup - v 7rh < - limsup [C' k - C" k ](T k v k - v k ) 

Vfc, < [£>-!£] (Tt; fc _i - «*_!). 

1-7 

We can look at the relation between our bound for general A and the bounds derived by Munos (Lemmas 
[221 and ES]): 

• Let us consider the case where A = 0. Then (3 = 7, A k = P k and 

B jk = (1 - 7) - 7ft) _1 ftft-i-ft+i- 
Then we have just shown that limsup^^^ v* — w 7rfc is upper bounded by: 
fe-i 



limsup J2 7* _i [( 7 - 7ft) _1 ftft-i-ft+i - {7(1 - 7ft) _1 ftft-l-ft + (ft) fc_j )] 6j- (13) 

The bound derived by Munos for Approximate Value Iteration (Lemma 12,21 page [7| is 

fc-i 

limsup (/ - 1 P k y 1 l k ~ 3 [PkPk-i-Pj+i - (P*) k ~ j ] ej 

3=0 

fe-1 

= limsup Y.^ 1 [^ I -^ P ky 1 PkPk-i-P ] +i-{I-lPky 1 {P*) k - 3 ]e ] 

k — >OC - r, 

3=0 

k-1 

= limsup [{I-lPk)~ l PkPk-r...Pj+i - {(I-jP k )-yPk(P*) k - J + (ft) fc-i )] e,-.(14) 

The above bounds are very close to each other: we go from equation [13] to equation [14] by replacing 
I). I-/', by (/>.. 



k-j 



When A = 1, /9 = 0, A* = (1 - 7)(J - 7 ft) _1 ft and 

B jfc = (l-7)(P*) fc - 1 - i ft+i(I-7ft+i)- 1 - 
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Our bound is 

fc-i 

limsup —v Wh < limsup V" ^,fc-i-j ' (p^- 1 -! Uj 

fc — >oo fc — >oo - . 

with 

«j := [7^+1 (/ - lPi+i)-\l - 7Pj) ~ IP*] ti- 
By definition of the supremum limit, for all e > 0, there exists an index ki such that for all j > ki, 

Uj < limsup ui + ee. 

I — >oo 

Then: 

fc-i fc-i , 

limsup ^ l k ~ 1 ~ j (P*) k ~ 1 ~ ju j < limsup ^ 7 fc ~ 1_i (P») fc_1 ~ i f hmsup + e 

fc — >oc , fc — >oo , \ / — 'OO 

3=ki 3=ki 

= (I — 7P») _1 I limsup ui + ee 

\ l — >oo 

As this is true for all e > 0, we eventually find the bound of Munos for Approximate Policy Iteration 
(Lemma 12.61 page IT0|) . 

Thus, up to some little details, our componentwise analysis unifies those of Munos. It is actually not surprising 
that we be able to find the result of Munos for Approximate Policy Iteration because the proof we develop in 
appendix [B] is a generalization of Munos 's proof for Approximate Policy Iteration in [3]. Similarly, the reason 
why we don't exactly fall back on the componentwise of Munos for Approximate Value Iteration is that our 
proof uses a technique that is slightly different from that of Munos in [5]. It would in fact be possible to 
write a proof that generalizes that of Approximate Value Iteration but this is not really fundamental as most 
of the results we are going to deduce won't be affected. 

The bound with respect to the approximation error can be improved if we know or observe that the value 
or the policy converges. Note that the former condition implies the latter. 



Corollary 3.3. Suppose the value converges to some v. Write it 


its ( 


ireedy policy and P the corresponding 


stochastic matrix. Consider the following stochastic matrices: 






B v := (1- 7 )[(1-A)(7- 7 P)- 1 P + 


A(7 


-iP^'P] 


D := {1 - -f)P4l - -fP*)- 1 






Then the error necessarily converges to some e and 






v* - < [B v - D] 

1-7 


e 
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Corollary 3.4. Suppose the policy converges to some n. Write P the corresponding stochastic matrix. The 
following matrices 



A* 

A jk 



B 3k 



(1-A 7 ) J P(/-A 7 P)- 1 



1-7 

syk j 

A* jk P 

1-7 
1 - A 7 



i — A7 * — ' 



-1/ ylirxk-l-j 



A 7 



(P*) fc - + (1 ~ A 1 7 ^ 1 ~ A) A%- fc (J - A 7 P)~ W 



are stochastic and 



fe-i 



- W * < limsup 1-^ 2 7 fc ^ [PJ fc - Pf fe ] ej 



j=k 



4 Span seminorm Performance bounds for A Policy Iteration 

Following the lines of Munos (5j 0] , we here show that the componentwise bounds we have derived in the 
previous section enable us to derive span seminorms bounds. We first describe some lemmas that enable us 
to make this derivation. Then we present the results for A Policy Iteration. 



4.1 From Componentwise bounds to span seminorm bounds 

We here present four lemmas that show how to derive span seminorms bounds from componentwise bounds. 
The proofs are close to those of [3 [4] and deferred to appendix |E] for clarity. The main difference is that we 
come up with span seminorm bounds instead of L p /max-norms. 
The first two lemmas are: 

Lemma 4.1. Let x k , y k be sequences of vectors and X k , X' k sequences of stochastic matrices satisfying 

limsup \x k \ < K limsup (Xk ~ X' k )y k 

k — >oo k — >oc 

where \x\ denotes the componentwise absolute value of x. For all distribution \i, fj,k '■= \n(X k + X' k ) is a 
distribution and 

limsup \\x k \\ p ^ < K limsup span p41k [y k ] 

k—*OQ k—>00 

limsup \\xkWao < A' limsup span x [y k ] 

k—*oo k — *oo 
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Lemma 4.2. Letxk, y k be sequences of vectors and Xjk, X'- k sequences of stochastic matrices satisfying 



jfe-1 



Vfc , limsup |jc fc | < if limsup ^ j k 3 {X k j - X' kj )yj 



k — >oo k — >oo 

J=k 



For all distribution /i, 



are distributions and 



Mfcj : = 7;K x kj +X' kj ) 



limsup \\x k \\ < lim 

k->oo 1 — 7 fc o-»<x 



sup span p ilk] [y ] 

k>j>k 



limsup \\xkWoo < —— limsup span^ [y k ] 

k — >oo / k — >oo 

The two other lemmas consider the concentration coefficient introduced by Munos in [11 [5] and already 
mentioned before (equation [6] page [8]) . We recall its definition. We assume there exists a distribution v and 
a real number C(y) such that 

Civ) :=max^. 

i,3,a V{j) 

If X is an average of products of stochastic matrices of the MDP, it can be seen that for any non-negative 
vector y fxXy < C{v)vy. Thus, with the notation of Lemma B~T] we have for all fc and all vector y, 



(span p Mfc [y]) p = min (\\y- ae\\ p ^ k 



= min^ fe |y - ae| p 

a 

= mm^(X k + X' k )j\y-ae\* 
< C{y) min u\y — ae\ p 

a 

= C(i/)nun(||y-ae|| J , )l/ ) P 
= C(y) (span p „ [y]) P . 
This leads to an analogue of Lemma Ej] involving this concentration coefficient: 

Lemma 4.3. Let x k , yk be sequences of vectors and X k , X' k sequences of stochastic matrices (that are 
averages of products of stochastic matrices of the MDP) satisfying 

limsup \x k \ < -ftTlimsup (X k -X k )y k . 

k — >oc k — >oo 

Then 

lim sup \\x k < K [C(v)] 1/p lim sup span pv [y k ] . 

k — >oo k — >oo 

Similarly, we get the following analogue of Lemma 
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Lemma 4.4. Let x k , yu be sequences of vectors and Xj k) X'^ k sequences of stochastic matrices (that are 
averages of products of stochastic matrices of the MDP) satisfying 



fe-i 

i 

k — >oc k — >oo 



Vfc , limsup \x k \ < If limsup ^ j k 3 {X k3 ■- X' kj )yj 

j=k 



Then 



limsup \\xkWoe < limsup span p v [y k ] 

k — >oc I k — >oc 



We are now ready to give some span seminorm bounds for A Policy Iteration. 

4.2 Rates of convergence for Exact A Policy Iteration 

Consider Exact A Policy Iteration. Using Lemma |4~T| Lemma I3TT1 becomes : 

Proposition 4.5 (Rates of convergence for Exact A Policy Iteration in span seminorm). 
With the notations of Lemma \3.1\ and for all distribution fi and indices k > kg, 



^yk—ko 



I loo < Y^ SPan °° ][V * ~ Vk ^ 



,.k — kn 



lv * ~ v " IU - — span P M E ^+ E '^) [Tvk0 " Vko] 



Woo 



syk—ko 

< "j span^ [Tv ko - v ko ] 

1-7 



IU, ^. 7r * II <r ~, k ~ k a 



(w* - Vko) ~ min[u,(s) - Wfc (s)]e + ||u*(s) - u' r *o+ 1 || c 

p,Ai(P„) fc ~ fc o 



I oo 



< 7 fc ~ fc0 (span^ [v, - u fco ] + - w'*o+> ||J (15) 



The first pair of rate is expressed in terms of the distance between the value function and the optimal value 
function at some iteration fco- The second pair of inequalities can be used as a stopping criterion. Indeed, 
taking k = fco + 1 it implies for the following Stopping condition, which generalizes that of Proposition 12.11 

Proposition 4.6 (Stopping condition for Exact A Policy Iteration). 

// at some iteration fco, the value v ko satisfies: 

1-7 

span^ [Tv ko - v ko ] < e 

7 

then the greedy policy Ttko+i with respect to v ko is e-optimal: — v 7rfc ° +1 || co < e. 

The last pair of inequalities rely on the distance between the value function and the optimal value function 
and the value difference between the optimal policy and the first greedy policy; compared to the others, it 
has the advantage of not containing explicitely a j— factor (though this factor is hidden in v* — v' !Th °+ 1 ). 
To the best of our knowledge, this bound is even new for the specific cases of Value Iteration and Policy 
Iteration. 

Using Lemma [4751 Lemma [37T1 gives the following analysis with respect to the concentration coefficient. 
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Proposition 4.7 (Rates of convergence for Exact A Policy Iteration in span seminorm). 
Let C(y) be the concentration coefficient defined in equation^ For all p and k>ko, 

K-«"*IL < Y^[C(v)] 1,P span p ^-v k0 ] 

IK-^IL < ]^[cM] 1/p pvK-%] 

Also: 

IK - «** IL < 7 fc - fco [C(^] 1/P (V0Hj„m [«♦ ~ «fco] + IK(«) - ^ 0+I ID 



4.3 Bounds for Approximate A Policy Iteration 

We can derive the same kind of bounds for Approximate A Policy Iteration. Using Lemmas 14.11 and 14.2 
Lemma [321 gives the following Proposition. 



Proposition 4.8 (Performance bounds for Approximate A Policy Iteration in span seminorm). 
With the notations of Lemma \3.S\ and for all p and all distribution \i 



limsup ||t!« -v* k \\ Pift < 

k — too 



limsup Hi;* — v 7Zk \ 

k — >oo 

limsup — v 7Zk || 

k — >co 

limsup \\v* — v^ k | 

k — >oo 

V*, K-t;**!! 



Vfc, Mv* - u 71 



< 
< 
< 
< 
< 



7 



(1 


- 7 )2 




7 


(1 


-7) 2 




7 


(1 






7 



lim 



sup span i j B +B , \ [e 3 \ 

k>j>k y '2^\ 1" jk/ 



limsup span^ [ej] 

j->oo 



limsup span p i^ Ck+c ,^ [T k v k - v k ) 



(1-7)- fe^oo 



limsup span^ [T k v k - v k ] 



— s P an pM D +K) [Tvk - 



1 - v k -i\ 



7 



1-7 



span^ [Tv k ^x - Vk-i] 



Similarly, Corollaries 12.51 and 12.91 give: 

Corollary 4.9 (Performance bounds for App.A Policy Iteration in case of convergence). 

// the value converges to some v, then the approximation error converges to some e. With the notations of 

Corollaries \3.3\ and\3.4\ the corresponding greedy policy it satisfies for all p 



7 



l«.-«1oo ^ 



span p , i^ Bu+D) [e] 



1-7 



span^ [ej 



If the policy converges to some ir, then: 



sup span i (B? +B ,^ [e,] 

k>j>k ^'2^y jk^ jk) 



7(1 - A7) 

1 1 - w T 1 1 oo < n — T2 lim sup span^ [e^] 

V 1 l) j— >oo 

Eventually, using Lemmas 14.31 and I4.4[ we have the following bounds involving the concentration coeffi- 
cient. 
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Proposition 4.10 (Performance bounds for Approximate A Policy Iteration in span seminorm). 
Let C{v) be the concentration coefficient defined in equation^ For all p and all k, 

j—^—r 2 [C{v)] 1/p limsup span Pt „[ej] 

\ L 1) j—>oo 

1 \n(„\-\ l /v 



lim sup 

k — >oo 


IK 


- v nh \\ 

Moo 


< 


lim sup 

k — >oo 


IK 


- v" k II 


< 


Vfc, 


IK 


- v* k || 

lloo 


< 



(1-7) 



[C{y)\ /p lim sup span [T k v k - v k ] 



k- 



1-7 



\C{v)] /p span pv [Tvk-i - Vk-i] 



Corollary 4.11 (Performance bounds for App. A Policy Iteration in case of convergence). 

If the value converges to some v, then the approximation error converges to some e, and the corresponding 

greedy policy it satisfies: 



1-7 



// the policy converges to some it, then: 



K-t/'L < ffi A J [C{v)] l/P limsup S pan p>u [ej] 

When comparing the specific bounds of Munos for Approximate Value Iteration (Propositions l2.3 land l2.4p 
and Approximate Policy Iteration (Propositions 12.71 and I2.8[l , we wrote that the latter had the nice property 
that the bounds only depend on asymptotic errors/residuals (while the former depends on all errors). Our 
bounds for A Policy Iteration have this nice property too. Considering the relations between the span 
seminorms and the other standard norms (equation Q] page H]) , we see that our results are more general than 
those of Munos. 



Conclusion and Future work 

We have considered the A Policy Iteration algorithm which generalizes the standard algorithms Value Iteration 
and Policy Iteration. We have reviewed some results by Puterman [7], Bertsekas and Tsitsiklis [2] and Munos 
[H[5], concerning the rate of convergence of these standard algorithms in the exact and approximate cases. 
We have extended these results to A Policy Iteration and proposed some new convergence rate in the exact 
case (equation[l5]page[2T]). The performance analysis we have described slightly improves the previous results 
in two ways. 1) As suggested by the results of Puterman in the exact case, the use of the span seminorm has 
enabled us to derive tighter bounds in generaQ. 2) Our analysis of Approximate A Policy Iteration relates the 
asymptotic performance of the algorithm to the asymptotic errors / residuals instead of a uniform bound of 
the errors/residuals and this might be of practical interest. More importantly, the main contribution of this 
paper has been to provide a unified view of the approximate versions of optimal control and reinforcement 
learning algorithms and to show that it is sound to use them. 

We now describe some future research directions. Munos introduced in [5] some concentration coefficients 
that are finer than the one we used throughout the paper and a natural track would be to consider them 
in order to derive finer performance bounds. From what we have achieved here, this not trivial since the 
componentwise analysis we derived for A Policy Iteration is significanlty more intricate than the ones we find 
in the specific limit cases A = (Value Iteration) and A = 1 (Policy Iteration) . Analogously to a recent work 
by Munos and Szepesvari [6] for Approximate Value Iteration, a potential use of these coefficients would be to 

7 If the approximation of A Policy Iteration is only due to the use of a Least-square Feature Based Linear Architecture, and 
if the constant e belongs to the set of features, then the span seminorm bound is equivalent to the standard L p norm bound. 
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derive an analysis of Approximate A Policy Iteration involving PAC-style polynomial bounds on the number 
of samples and a quantitative measure of the power of the approximation architecture (like the well-known 
VC dimension). 

Another important direction is to study the implications of the choice of the parameter A, as for instance 
is done by Singh and Dayan in [8]. The original analysis by Bertsekas and Ioffe [1] shows how one can 
concretely implement Exact A Policy Iteration. Each iteration requires the computation of the fixed point 
of a /?-contracting operator (see equation [TJ] page [TJ]) . We plan to study the tradeoff between the ease for 
computing this fixed point (the smaller (3 the faster) and the time for A Policy Iteration to converge to the 
optimal policy (the bigger f3 the faster). In parallel, the reader might have noticed that most of the bounds 
we have provided do not depend on A. We believe that the finer concentration coefficients of Munos we 
have just discussed should also help keeping track of the influence of A on the performance of the exact or 
approximate algorithm. We expect that we should be able to derive results relating the smoothness of the 
MDP problem to the choice of A. 
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Appendix 

The following Appendices contains all the proofs concerning the analysis of A Policy Iteration. We write 
Pk = P nh the stochastic matrix corresponding to the policy ir k which is greedy with respect to v^-i, P* 
the stochastic matrix corresponding to the optimal policy 7r*. Similarly we write T k and % the associated 
Bellman operators. 

The proof techniques we have developped are strongly inspired by those of Munos in the articles [U [5] . 
Most of the inequalities will appear from the definition of the greedy operator: 

TT = greedy <=> W , T^'vKT^v. 

We will often use the property that an average of stochastic matrices is also a stochastic matrix. A recurrent 
instance of this property is: if P is some stochastic matrix, then the geometric average 

oo 

(i - a) J2( ap y = ( i - - ap ) _1 

i=0 

with < a < 1 is also a stochastic matrix. Finally, we will use the property that if some vectors x and y are 
such that x < y, then Px < Py for any stochastic matrix. 



A Componentwise bounds with respect to the Bellman residuals 

In this Appendix, we study the loss 

h := - 

with respect to the two following Bellman residuals: 

b' k := T k v k - v k 

b k := T k+1 v k -v k = Tv k - v k 

b' k says how much v k differs from the value of ir k while b k says how much v k differs from the value of the 
policies 7Tfc + i and 7r*. 

A.l Bounds with respect to Policy Bellman residual 

Our analysis relies on the following lemma 

Lemma A.l. Suppose we have a policy it, a function v that is an approximation of the value v w of ir in the 
sense that its residual b' := T^v — v is small. Taking the greedy policy n' with respect to v will reduce the loss 
as follows: 

u* - v v ' < 7 P*(v* - v v ) + (jP*(I - jP)- 1 - jP'(I - iP'Y 1 ) V 
where P and P' are the stochastic matrices which correspond to it and it' . 
Proof. We have: 

v* - v n ' = %v* - T"'/ 

= %v* - %v* + T.v* - %v + %v - T^'v + T^'v - T^v*' 

< 7P»(w»-w 7r ) + 7P*K-w)+7P'(w-« 7r ') (16) 
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where we used the fact that %v < T T v. One can see that: 



v = r 1 / - v 

= TV - r"« + T^v - v 

= 7P(w 7r - v) + b' 

= (/- 7 P)" 1 6' (17) 



and that 



v - « w = u - T T 

= v-T 7r v + T*v-T*'v + T*'v-T*'v*' 

< -b' + ^P'(v-v K ') 

< {I- lP ')-\-b'). (18) 

where we used the fact that T*v < T 71 ' v. We get the result by putting back equations [T7| and [18] into 
equation [TH □ 

To derive a bound for A Policy Iteration, we simply apply the above lemma to ir = TT k , v = Vk and 
7r' = 7Tfc+i. We thus get: 

Zfc+i < 7ftfc + (7^.(^ - 7Pfc) _1 - 7ft+iU - 7-Pfc+i)" 1 ) K 
Introduce the following stochastic matrices: 

C k := (1 - 7 ) 2 (7 - (ft (J - 7ft)" 1 ) 

C£ := (1 - 7 ) 2 (/ - 7ft)" 1 (ft+i(/ - 7ft+i)" 1 ) 
This leads to the following componentwise bound: 

limsup l k < — — limsup [C k - C' k ] b' k 

k-,00 (1 - 7) fc-»oo 

A. 2 Bounds with respect to Bellman residual 

We rely on the following lemma (which is for instance proved by Munos in [5]) 

Lemma A. 2. Suppose we have a function v. Let it be the greedy policy with respect to v. Then 

tt* - v* < 7 [P. (I - 7ft)" 1 - P*(I - 7PT 1 ] {T*v ~ v) 
We provide a proof for the sake of completeness: 
Proof. Using the fact that %v < T w v, we see that 

u, - v* = %v*- T n v n 

< %v*-%v + t w v-t w v w 

= 7P*(w» -v)+~/P ir (v-v*) 

= 7P*(u» - v w ) + yP*(v* - w)7P 7r (f - t^) 

< (I-jP*)- 1 ^, -7P 7r )(« 7r - w). 
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Using equation [TT] we see that: 

- v = {I - 1 P'«)- 1 (T*v - v). 

Thus 

< (I - 1 P^)- 1 { 1 P^- 1 P"){I --yP*)- 1 {T*v-v) 

= (I- 1 P*)- 1 { 1 P< t -I + I- 1 P*){I- 1 P*)- 1 (T*v-v) 

= [(I-7P*)- 1 -(J-7P 7r )- 1 ](r^-«) 

= j[P*{I --/P*)- 1 -P 7r (/- 7 P 7r )- 1 ] (T*v-v). □ 

To derive a bound for A Policy Iteration, we simply apply the above lemma to v = Vk-i and ir = irk- We 
thus get: 

Ik^zr^-lD-D^bk-i (19) 

1-7 



where 



D := (1-7)P,(J-7P.)- 1 
D' k := (l- 7 )P fe (/-7ft)" 1 



are stochastic matrices. 



B General componentwise analysis of A Policy Iteration 

This Appendix contains the core of all the remaining results. We show how to compute an upper bound of the 
loss for (approximate) A Policy Iteration in general. It will be the basis for the derivation of componentwise 
bounds for approximate A Policy Iteration (section [3. 2p and Exact A Policy Iteration (section [3. ip . 

B.l Overview of the analysis of the componentwise loss bound 

We define: 

• the loss of using policy ttu instead of the optimal policy: 

ffc := u* - v Vh 

• the value of the k th iterate b.a. (before approximation): 

Wk ■= v k - e k 

• the distance between the optimal value and the k th value b.a. : 

dk := i'* - w k 

• the shift between the k th value b.a. and the value of the k th policy: 

Sk ~Wk- v nk 

• the Bellman residual between the k — 1 th and the k th values b.a.: 

b k := T k+1 v k - v k = Tv k - v k 
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All our results come from a series of relations involving the above quantities. In this Appendix, we will 
use the notation x for an upper bound of x and x for a lower bound. 
Let us define the following stochastic matrix: 

A k :=(l-A7)P fe (/-A 7 F fc )- 1 . 

Also Define the following factor 
Then 

Lemma B.l. The shift is related to the Bellman residual: 

Sk = (3(i- 7 p fe r l ^fe(-&fe-i). 

Lemma B.2. The Bellman residual at iteration k + 1 cannot be much lower than the Bellman residual at 
iteration k: 

bk+i > f3A k+1 b k + Xk+i 
where Xk ■= (7-P/t — I)e k only depends on the approximation error. 

As a consequence, a lower bound of the Bellman residual is: 

k 

b k > P h -HAkA k - 1 ..A j+1 )x j +p k - k °{A k A k _ l ...A ho+1 )b kQ i=b k . 

Using Lemma [B.l] the bound on the Bellman residual also provides an upper on the shift: 

s k < (3(1 - 7^)~ 1 A fc (-bfc-i) := 
Lemma B.3. The distance tends to reduce: 



dk+i < jP*d k + y k 



where y k ■= 1 ^\^ A k+ i (—bk) — jP*e k depends on the lower bound of the Bellman residual and the approxi- 



mation error. 

Then, an upper bound of the distance is: 

fe-i 

d k < -y'-^W'^Vj +7 k - ko (P*) k - k °d ko = T k . 

j=ko 

Eventually, as 

h = d k + s k < d k + 

the upper bounds on the distance and the shift will enable us to derive the upper bound on the loss. 

The above proof is a generalization of that of Munos in [4] for Approximate Policy Iteration. When A = 1, 
that is when both proofs coincide, since (3 = Lemmas lB.ll and l B.21 have a particularly trivial form. 
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B.2 Proof of Lemma lB.lt a relation between the shift and the Bellman residual 

We have: 

(I-jPk)s k = (I-tPjOK-^*) 

= (I - jP k )w k - r k 

= (I- A 7 P fe + A 7 P fc - jP k )w k - r k 

= {I - XjP k )w k + (A 7 P fc - lP k )w k - r k 

= r k + (1 - A) 7 P fc v fc _i + (A - 1)~/P k w k - r k 

= (1 - \)~/P k {v k -i - w k ) 

= (1 - A) 7 P fe (J - XjP k y\v k ^ - %v k - X ) 

= (l-A) 7 P fc (/-A 7 P fe )- 1 (-& fc _ 1 ) 



Therefore 
with 



s k =l3{I- 1 P k )- 1 A k {-b k „ 1 ) 



A k := (l-A 7 )P fe (/-A 7 P fc 



Suppose we have a lower bound of the Bellman residual: b k > bk (we will derive one soon). Since 
(/ — 7 Pfc)~ 1 J 4/ c only has non-negative elements then 

s k < 0(1 - iPk^A^-bk) := s]t. 

B.3 Proof of Lemma lB.2b a lower bound of the Bellman residual 

From the definition of the algorithm, and using the fact that T k v* k = v^ k we see that: 

b k = T k+1 v k - v k 

= T k+i v k - T k v k +T k v k - v k 

> T k v k - v k 

= T k v k - T k v* k + v* k - v k 

= yP k (v k - v* k ) + v* k -v k 

= ( 7 P fc - I)(s k + e k ). 

= (3A k b k ^ + {fP k - I)e k . (20) 



where we eventually used the relation between s k and b k (Lemma IB. ip . In other words: 

b k +i > /3A k+1 b k + x k+1 

with 

x k := ( 7 P fc - I)e k . 

Since A k is a stochastic matrix and (3 > 0, we get by induction: 

k 

b k > Yl k 'H^kA k - 1 ..A j+1 )x j +f3 k - ko (A k A k _ 1 ...A ko+1 )b ko :=h- 
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B.4 Proof of Lemma IB.31 an upper bound of the distance 



Given that %v* = u*, we have 



+ (I - X^Pk+i) l (%v*-v*) 



Therefore the distance satisfies: 



dk+i = - w k+ i 

= {I - A7.Pfc+i)~ 1 [(T* 1 '* - XyPk+iV*) - (T k+ iv k - \"fP k +iv k )} 

= {I- XjP k+1 )- 1 [%v !t - T k+1 v k + XjP k+1 {v k - «.)] 

= \~/P k+ id k+1 + %v* - T k+1 v k + \^P k+ i(v k - v*) 

= \~/P k+ id k+1 + %v* - T k+1 v k + \-fP k+ i(w k +e k ~ t>*) 

= X^P k+ id k+1 + %v* - %+\v k + \~fP k+ i(e k - d k ) 

= %v* - T k+ iv k + X'fPk+i (e k + d k+ i - d k ) 

Since 7r fc+1 is greedy with respect to v k , we have T k+ \v k > %v k and therefore: 

%v* - T k+ \Vk = %v* - %v k + %Vk - % +1 v k 

< XsU* — T*v k 

= 7P*(v*-i>fc) 

= 7P*(w* - (w k + e fe )) 

= "/P*d k - 7P*£fe 

As a consequence, the distance satisfies: 

d k+ i < ~fP*d k + X~fP k+1 (e k + d k+ i - dk) - jP*e k 



Noticing that: 



we get: 
where 



e k + dh+i - d k = e k + w k - w k+ i 

= Vk - Wk+l 

= -(I- XjP k+1 )~ 1 (T k+1 v k ~ v k ) 

= (7-A 7 P fe+1 )- 1 (-fe fc ) 

< (I ~X 1 P k+1 )- 1 {-b k ) 

d k+ i < ~/P*d k + y k 
A7 



Vk ■= 



-A k +i(-h) - 7-P*£fe- 



1 - A 7 

Since P* is a stochastic matrix and 7 > 0, we have by induction: 

fe-i 

d k <Yl i k ~ 1 - j (P*) k ~ 1 - j yj + i k - ko {P*) k - ko d 

j=k 



k„ — dk- 
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C Componentwise rate of convergence for Exact A Policy Iteration 

We here derive the convergence rate bounds for Exact A Policy Iteration (as expressed in Proposition 14.51 
page [21]). We rely on the loss bound analysis of Appendix [B] with e k = 0. In this specific case, we know that 
the loss Ik < d k + Sfc where 

-h = p k - k «A k A k _ 1 ...A ko+1 {-b ko ) 
\ k ~ 1 

j=ko 

sj: = f3(I-" / P k )- 1 A k (-b t± ) 

We therefore have: 

. fc-i 

Tk = T~X^ S 7 fe - 1 ^> i - feo (n) fe - 1 - J 'A J - +1J 4 J ,..A )to+1 (-6 fco ) + 7 ' ! - fe °(P*) fc - fe °4 



j=k 



and 

Therefore: 
with 



Sfe = I3 k ~ k0 (l - 1 P k )~ 1 A k A k ^ 1 ...A ko+1 {-b kn ) 

h < (jZ^) E' kk0 (-bk )+7 k - k0 (P*) k - k0 d k0 (21) 



Lemma C.l. -EL «s a stochastic matrix 

ft, /Co 

Proof. 



1 - 


1 




■ fco 


l - 


1 


ryk — 


fco 


1 - 




ryk — 


fco 


1 





^fe-feo 



l — X-yf-? 1-7 

3=fco 



1 — A7 7 — /3 1 — 7 

-,fc— kn ok—kn ok — k< 



1 — 7 1 — 7 

where we used the facts that = j^-g and (1 — /3)(1 — A7) = 1 — 7. □ 

C.l A bound with respect to the Bellman residual 

We first need the following lemma: 

Lemma C.2. The bias and the distance are related as follows: 

bk > (I - jP*)dk- 
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Proof. Since n k+ i is greedy with respect to v k , T k+ \v k > %v k and 



b k = T k+1 v k - v k 

= T k+1 v k - %v k + %v k - %v* + v* - v k 

> lP*{v k - U*) + - Ufc 

= (/-7P,)dfc. □ 



We thus have: 
Then equation [2"T1 becomes 

Zfc < 

where: 

is a stochastic matrix. 



fe— / p "\fc— fco / 





-k 


ryk — 


ko 


1 - 


1 



.,k — kr> 



1-7 



£1 



Sfcfco :=(l-7)(i'.) fc - fco (I-7i > .) 



C.2 A bound with respect to the distance 

From Lemma IC. 21 we know that 

-b ko < (/-7-P*)(-4 ) 

Then equation [2"T1 becomes 



h. < 





-k 




ko 


1 - 


1 



..k—kn 



1-7 



E' kko (I- 7 P*) 



where 

is a stochastic matrix. 



F kk0 := {l-l)Pt ko +lE' kko P 



C.3 A bound with respect to the distance and the loss of the greedy policy 

Let K be a constant such that v ko := v ko — Ke. The following statements are equivalent: 

K > o 

% +lVk > V ko 

rfe +i + 7-Pfco+iKo - > "feo - Ke 

{I ~ lP ko +i)Ke > -r ko+1 + (I - jP ko+1 )v ko 

Ke > {I - 1 P ka+l )- 1 {-r ka+l )+v kQ 

Ke > v ko - v Vk ° +1 
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The minimal K for which b ko > is thus K := max s [vk {s) — v 1Tk a+ l (s)]. As v k[) and wa; only differ by a 
constant vector, they will generate the same sequence of policies 7r& +i,7r& +2... Then, as b ko > 0, equation 
ED tells us that 

= i k - ko {P*) k - ko {v*-v ko +Ke) 

Now notice that 

K = ma,x[v ko (s) ~ vJs) + i>*(s) - v* k ° +1 (s)] 

s 

< max[«fc (s) — (s)] + max[o, (s) — v nk ° +1 (s)] 

s s 

= -min^s) - v feo (s)] + ||u»(s) - u 7r,eo+I IL 

s 

Then, using the fact that {P„) k ~ k °e — e, we get: 

< 7 fc - fc «(P i ,) fc - A; ° [(^-^J- m in[<;*(s)- Ufco (s)]el + ||«.(«) -u^o+iy e. 

L s J 

D Asymptotic componentwise loss bounds with respect to the ap- 
proximation error 

We here use the loss bound analysis of Appendix [B] to derive an asymptotic analysis of approximate A Policy 
Iteration with respect to the approximation error. 

D.l General analysis 

Since 

h = d k + s k < 4 + Jk, (22) 

an upper bound of the loss can be derived from the upper bound of the distance and the shift. 
Let us first concentrate on the bound d k of the distance. So far we have proved that: 

fc-i 

4 = ^7 fe ~ 1 ~^*) fc ~ 1 ~V + 0(7 fe ~ feo ) 

i=k 
A 7 



1 - A 7 



A i+ i(-bi) - jP*€i 



Writing 



]T (A i A i - 1 ...A i+1 ) (- Xj ) + O( 7 i - fc0 ) 
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and putting all things together, we see that: 



d k = 



AT 



fc-1 



fc-1 



1 - A7 
1 - A7 



i=fec 
fc-1 



i—ko 



fc-1 



E E l"- 1 -^-^,,^ - 7^)e 3 - E 7^*)^ + O(7 fe - fc0 ) 



i=fco J=feo 

fc-1 fc-1 



i=k 
fc-1 



1 - A7 



3 =fc i=j 



j=k 



fc-1 

E 

j=fc 



fc-i 



A 7 



tj + oh 



k — kn 



Let us now consider the bound Sk of the shift: 

sib = ( S(/- 7 P fe )- 1 A fe (-6 fe ) 



/3(/- 7 P fc )" 1 A fe 



fc-i 



E /J*" 1 "' (A fe _ 1 A fe _ 2 ...A J - +1 ) (-*,-) + 0(7 



fe— fen 



E f— ^fc(/- 7 P,-h+O(7 fe - fc0 ) 



with 



y i)fc := (1 - 7) (J - 7flb)- 1 A fc Afc_i...A i+ i. 
Eventually, from equations [221 [23] and [2H we get: 



fc-i 

h< E 

.j=fco 



fc-1 



Introduce the following matrices: 



B 



1-7 

ryk—j 



fc-1 



2^7 P J A lJ . fe + ^ -Y,^ 



1 - A7 



'=.; 



1-7 



P/ ft := 7 B 3 - fc P 3 - + (l- 7 )(P,) 
Lemma D.l. Bjk and P/- fc are stochastic matrices. 



k-j 



(23) 



(24) 



(25) 
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Proof. It is clear from the definition of AQj.fc and Yj^ that normalizing Bjk and will give stochastic 
matrices. So we just need to check that their norm is 1. 



\B jk 



(1-7) 



fe-1 



A 7 



(1-7) 



A7 7^ - /3 fc ^ (3 k ~ 



(1-7) 



1 — A7 7 — /? 1 — 7 

7 fc-j _ ^fc-j 0k-j 

(1-A 7 )(1-/S) + 1^7 



(1-7) 



7 



1-7 



1-7 



1. 



where we used the identities: A7 = y-£ and (1 — /3)(1— 7 A) = 1—7. Then it is also clear that ||^- fe || = 1. □ 
Equation [25] can be rewritten as follows: 



h < E 



T 



1-7 
fe-i 



B jk {I-iP j )- 1 *->(P i 



k-jfp \k-j 



T~Z E ^ - B 'ok] ^ + °^ k °) 



j=k 



Taking the supremum limit, we see that for all k , 

lim sup l k < — !— lim sup E T^ -3 [ B jk - B' jk ] e 3 - 



fe-i 



(26) 



D.2 When the value converges 

Suppose A Policy Iteration converges to some value v. Let policy n be the corresponding greedy policy, with 
stochastic matrix P. Let b be the Bellman residual of v. It is also clear that the approximation error also 
converges to some e. Indeed from Algorithm [3] and equation [HI we get: 

b = Tv-v = (J-A 7 P)(-e) 

From the bound with respect to the Bellman residual (equation fT9l page l27j) . we can see that: 

t>* -v' < [(i - yp*)- 1 - {i - ipy 1 ] b 

= [(I - 7P)- 1 - (I - 7P*)- 1 ] (/ - A 7 P)e 

= [(/ - lP)-\l - A 7 P) - (I - 7^*r 1 (/ - A7P)] e 

= [(/ - 7^)~ V - IP + IP - A 7 P) - {I ~ iP^il - \lP)] e 

= [(J + (1 - A)(7 - fPrhP + A(/ - 7^)- 1 7^) - (I ~ IP*)- 1 } £ 

= [((1 - A) (J - 7^T V + A(I - 7^*)~ V) - (/ - 7P.r 1 7 P*] e 



1-7 



[B v - D] e. 
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where 



B v := (1- 7 )((1-A)(/- 7 P)- 1 P + A(/- 7 P«)- 1 P) 
D := (1 - j)P*(I - jP*)- 1 . 



Lemma D.2. B v and D are stochastic matrices. 
Proof. It is clear that \\D\\ = 1. Also: 

115.11 = (l-7)(l + 

= (l-7)(l + 
= 1. □ 



(1-A) 7 , A 7 



1 — 7 1 — 7 

7 



1-7 



D.3 When the policy converges 

Suppose A Policy Iteration converges to some policy ir. Write P the corresponding stochastic matrix and 

A* := (1 - A 7 )P(J- A^P)" 1 . 

Then for some big enough fc , we have: 



fc-i 



where 



4* -, — 



1-7 



1 — At t— f 



^A* jk A*(I - 7 P) - j k - j (P*) k -i 
1-7 



fc-1 



£j +O( 7 fe - fe0 ) 



A 7 



is a stochastic matrix (for the same reasons why Bjk is a stochastic matrix in Lemma lD7j) . Noticing that 

A*(I-yP) = (l-A 7 )P(/-A 7 P)- 1 (/-7-P) 

= (1- A 7 )P(7- A 7 P)- 1 (/-A 7 P + A 7 P- 7 P) 

= (1-A 7 )P(/-(1-A)(/-A 7 P)" 1 7 P) 

= (l-A 7 )P- 7 (l-A)A 7r P 



we can deduce that 



fc-i 



j=k 



Y^Z*** [(1 - A 7 )P - 7(1 - X)A*P] - 7 ^'GP*) fe - j 



1-7 J 



7( i 1 X) A* jh A«P+(P m ) k -i 
1-7 



ej + O( 7 fc - fco ) 
ej + O( 7 fe - fe0 ) 



V-^E" A ' [^-5f fc ]e, + 0(7 



fc-1 



k — kn 



(27) 



j=k 
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where 



Bj k := A" sk P 



D 



jk 



1-7 
1 - A 7 



7(1 A) A" jk A"P+{P,) k -> 



1-7 



Lemma D.3. BJ k and Bp. are stochastic matrices. 



Proof. It is clear that ||-BJ fe || = 1. Also: 



\B 



3 k 



1-7 



1 



7(1 -A) 

1 - A 7 V" 1 " ^ 1-7 
1 — 7 1 — 7 + 7 — A7 
1 — A7 1 — 7 
1. □ 



E From componentwise bounds to span seminorm bounds 



This Appendix contains the proofs of the Lemmas l4.1|4,2^4.3l and l4~4"l that enable us to derive span seminorm 
performance bounds from the componentwise analysis developped in the previous Appendices. 



E.l Proof of lemma 14.11 

Write a k :■ 
write that: 



Write a k := argmin a \\y k — ae \\ p u k - ^ s ^ k anc ^ are stochastic matrices, X k e = X' k e = e, and we can 



limsup \x k \ < iHimsup (X k - X' k )(y k - a k e). 
k — >oo k — >oc 



By taking the absolute value componentwise we get 



limsup \x k \ < K limsup (X k + X' k )\y k — a k e\ 



k — >c 



fe^c 



It can then be seen that 

limsup (\\x k \\ p Y 

k — ^rvi ^ ' 



K p limsup /.i(\x k \) p 



l -(X k + X' k )^2\y k -a k e\ 



k — >-oo 

< K p lim sup fj, 

k— »oo 

< if" lim sup K(X k + X' k )(2\y k ~ a k e\) p 

k—*oo * 

= K p limsup n k (2\y k -a k e\) p 



K p lim sup ( 2 \\y k - a k e 



k — >c 



K p limsup (span p Mfc [y k ]) p 



k — >oc 



where we used Jensen's inequality (using the convexity of x 1— > x p ) and where the last inequality results from 
the definition of a k . 
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E.2 Proof of lemma [4.21 



Write cikj := argmin a \\yj — ae|| p ^ fc . As Xj k and X'j k are stochastic matrices, X k e = X' k e = e and we can 
write that: 

fe-i 

limsup \x k \ < K limsup ^ l k ~ 3 i x k] ~ Xkj)(Vj ~ a fcj e )- 



k — »oo 

By taking the absolute value we get 



k — >oo 



j=k 



fc-1 



limsup \x k \ < if limsup ^ j k j (X kj + X' kj )\yj - a kj t 



k — >oo 



k — >oc 



j=k 



It can then be seen that 



limsup (||a:fc||p )M ) = # p limsup ^{\x k \) p 



< K p limsup \x 

k—>oo 



K p lira sup \x 



J2l k - 3 (X kj +X' kl )(\y 3 -a kj e\) 



j=k 



E 7 fe -^(A fc , + 2| W - a fei e| f 7 



< AT P lim sup fj, 



k—>c 



k — >oo 



k-1 

E 

j=ko 
k-1 



< A' p limsup ^ 7 fc j 2 \\ Vj - a kj e\\ p 



K> 



< K 1 



1-7 



j=k 
p-1 



y k ~t 7 fc - 




^ fc-i 


- a kj e\f | 






\3=k a 








I'd 



,1-7 



P-l 



fc-i 



limsup 7 fc 5 span ppfcj 



j=k 
k-1 



K> 



k p 



p-i 



1-7 

7 

1 — 7/ 1 — 7 
p 



limsup 7 



fe-j 



1 



j=k 



sup span^ [yj/] 

k'>j'>k 



sup span 

k'>j'>k 3 



7 



1-7 



sup span , [j/,-] 

k'>f>k fcJ 



where x p means the componentwise power of vector x, and where we used Jensen's inequality (with the 
convex function x 1— ► a; p ) and the fact that Ej=/s l k ~^ — jhz- As this is true for all k , and as fc 1— > 
sup fc , >>:( -, >fcn span p , ( [yj'] is non-increasing, the result follows. 



INRIA 



Performance Bounds for A Policy Iteration 



39 



References 

[1] D. Bertsekas and S. Ioffe. Temporal differences-based policy iteration and applications in neuro-dynamic 
programming. Technical Report LIDS-P-2349, MIT, 1996. 

[2] D.P. Bertsekas and J.N. Tsitsiklis. Neurodynamic Programming. Athena Scientific, 1996. 

[3] G.J. Gordon. Stable function approximation in dynamic programming. In Armand Prieditis and Stuart 
Russell, editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 261- 
268, San Francisco, CA, 1995. Morgan Kaufmann. 

[4] R. Munos. Error bounds for approximate policy iteration. In Proceedings of the International Conference 
on Machine Learning, pages 560-567, 2003. 

[5] R. Munos. Performance bounds in Lp norm for Approximate Value Iteration. SIAM Journal on Control 
and Optimization, 2007. To appear. 

[6] R. Munos and C. Szepesvari. Finite time bounds for sampling based fitted value iteration. Journal of 
Machine Learning Research, To appear. 

[7] M. Puterman. Markov Decision Processes. Wiley, New York, 1994. 

[8] S. Singh and P. Dayan. Analytical mean squared error curves for temporal difference learning. Machine 
Learning Journal, 32(l):5-40, 1998. 

[9] R.S. Sutton and A.G. Barto. Reinforcement Learning, An introduction. BradFord Book. The MIT Press, 
1998. 

[10] R. Williams and L. Baird. Tight performance bounds on greedy policies based on imperfect value 
functions, 1993. 



RR n° 6348 




Unite de recherche INR1A Lorraine 
LORIA, Technopole de Nancy-Brabois - Campus scientifique 
615, rue du Jardin Botanique - BP 101 - 54602 Villers-les-Nancy Cedex (France) 

Unite de recherche INRIA Futurs : Pare Club Orsay Universite - ZAC des Vignes 
4, rue Jacques Monod - 91893 ORSAY Cedex (France) 
Unite de recherche INRIA Rennes : IRISA, Campus universitaire de Beaulieu - 35042 Rennes Cedex (France) 
Unite de recherche INRIA Rhone-Alpes : 655, avenue de l'Europe - 38334 Montbonnot Saint-Ismier (France) 
Unite de recherche INRIA Rocquencourt : Domaine de Voluceau - Rocquencourt - BP 105 - 78153 Le Chesnay Cedex (France) 
Unite de recherche INRIA Sophia Antipolis : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis Cedex (France) 



Editeur 

INRIA - Domaine de Voluceau - Rocquencourt, BP 105 - 78153 Le Chesnay Cedex (France) 

http://www.inria.fr 
ISSN 0249-6399 



